Sahigara et al. Journal of Cheminformatics 2013, 5:27 
http://www.jcheminf.eom/content/5/1/27 



• Journal of 
Cheminformatics 



RESEARCH ARTICLE Open Access 



Defining a novel /c-nearest neighbours approach 
to assess the applicability domain of a QSAR 
model for reliable predictions 

Faizan Sahigara, Davide Ballabio, Roberto Todeschini and Viviana Consonnf 



Abstract 

Background: With the growing popularity of using QSAR predictions towards regulatory purposes, such predictive 
models are now required to be strictly validated, an essential feature of which is to have the model's Applicability 
Domain (AD) defined clearly. Although in recent years several different approaches have been proposed to address 
this goal, no optimal approach to define the model's AD has yet been recognized. 

Results: This study proposes a novel descriptor-based AD method which accounts for the data distribution and 
exploits /c-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is a 
three-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspired 
from the adaptive kernel method for probability density function estimation, the first stage of the approach defines 
a pattern of thresholds corresponding to the various training samples and these thresholds are later used to derive 
the decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the second 
stage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking model 
statistics and prediction error into account. 

Conclusions: The proposed approach addressed a novel strategy that integrated the kNN principle to define the AD 
of QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local density 
of samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density; 
b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to the 
smoothing parameter k; and d) versatility to implement various distances measures. The results derived on a case study 
provided a clear understanding of how the approach works and defines the model's AD for reliable predictions. 
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Background 

The popularity of QSARs has seen a growth from time 
to time and was complemented by the availability of 
more sophisticated and efficient model development tech- 
niques. This fact was further supported by the consider- 
ation of QSAR predictions for regulatory purposes. To deal 
with risk assessment of chemicals for their safe use, a new 
European legislation - REACH (Registration, Evaluation, 
Authorization and restriction of Chemicals) was approved 
in the recent years [1]. To reduce animal testing and re- 
placing them by cost effective methods, this law encourages 
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the use of QSARs as a possible alternative when enough ex- 
perimental data is not available, provided that the model 
was strictly validated for its regulatory consideration [2]. 

There are several aspects that must be taken into ac- 
count before considering a QSAR model reliable enough. 
In other words, the validity of a model has to be evalu- 
ated. Existing literature has often emphasized upon val- 
idating the QSAR models to reflect their robustness and 
predictive ability. In 2004, following five OECD principles 
for model validation were adopted to validate a QSAR 
model for its regulatory consideration: a) a defined end- 
point; b) an unambiguous algorithm; c) a defined domain 
of applicability d) appropriate measures for goodness-of- 
fit, robustness and predictivity and e) mechanistic inter- 
pretation, if possible [3]. 
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Applicability domain (AD) of a QSAR model defines 
the models limitation in its structural domain and re- 
sponse space. In other words, this principle for model 
validation restricts the applicability of a model to reliably 
predict those test samples that are structurally similar to 
the training samples used to build that model [4-6]. Sev- 
eral approaches were proposed in the past years to de- 
fine the AD of QSAR models. These approaches mainly 
differed in the algorithm used to characterise the AD 
within the descriptor space, where the model can predict 
reliably [7,8]. For instance, some classical approaches 
suggested defining the domain of applicability by a) con- 
sidering the range of descriptors values; b) enclosing the 
training space in a convex hull; c) calculating the dis- 
tance of a query compound from a defined point within 
the models descriptor space and d) estimating the Prob- 
ability Density Function for the given data. All these ap- 
proaches were associated with their own advantages and 
limitations [2,7-10]. From time to time, several ap- 
proaches were proposed that were aimed to be more ef- 
ficient or were thought to overcome several limitations 
of existing approaches. 

This article proposes a new heuristic approach towards 
defining the AD of QSAR models. The basis of this 
novel strategy is inspired from the /c-Nearest Neighbours 
(kNN) approach and adaptive kernel methods for prob- 
ability density estimation (kernel density estimators, KDE) 
[11]. Due to its simplicity and easy implementation, kNN 
had been a preferred choice for several proposed QSAR 
studies [6,12-18]. 

In the classical kNN approach for AD evaluation 
[6,18], average distances of all the training samples from 
their k nearest neighbours are calculated and used to de- 
fine a unique threshold to decide if a test sample is inside 
or outside the models AD (for example, 95th percentile). 
Moreover, in the framework of the probability density 
function estimation, the nearest neighbour method pro- 
vides density estimates depending on the Euclidean dis- 
tance to the /c-th nearest data point [19]. Following the 
same concept, the proposed method tries to integrate the 
kNN principle with the salient features of adaptive kernel 
methods [11], which define local bandwidth factors corre- 
sponding to the training data points and use them to build 
the density estimate at a given point. 

The novelty of the kNN based AD approach proposed 
in this article lies in the overall strategy that is properly 
executed in a three-stage procedure to encapsulate and re- 
flect upon several significant aspects towards model valid- 
ation. Moreover, some features common to most of the AD 
approaches were dealt differently with this approach; for 
instance, rather than defining a general threshold as in all 
the distance-based approaches, each training sample in this 
approach was associated with its individual threshold; in 
order to find an optimal smoothing parameter /<, this 



approach performed a /^-optimization procedure based on 
Monte Carlo validation; additionally, models statistical pa- 
rameters and other relevant aspects were dealt simultan- 
eously to reflect upon the reliability in the derived results. 

To better understand the strategy behind this approach, 
it was implemented on a dataset from the literature. The 
dataset was chosen from the CAESAR project to predict 
the bioconcentration factor (BCF) [20,21], The derived re- 
sults were discussed in comparison with those derived 
from other literature AD approaches. 

Methods 

/c-Nearest Neighbours principle from AD perspective 

The kNN principle basically reflects upon the structural 
similarity of a test sample to the training samples used to 
build that model. In theory, the distance of a query sample 
is considered from its k closest data points in the chemical 
space. Lower distance values correspond to a higher simi- 
larity, while the increasing distances signify higher levels 
of structural mismatch. The k value plays a significant role 
in defining how constraint the approach will be and thus, 
it can be referred to as the smoothing parameter. 

A stepwise execution of the following three stages 
characterises the workflow of this approach: 

1) defining thresholds for training samples 

2) evaluating AD for new/test samples 

3) optimizing the smoothing parameter k 

To allow a better interpretation of the proposed ap- 
proach, results on a two-dimensional simulated dataset 
will be considered throughout the major part of this dis- 
cussion and wherever applicable. As shown in Figure 1, 
this dataset has a cluster of 48 training samples and the 
remaining two training samples (49 and 50) are located 
quite in the extremities of the space with respect to 
these clustered samples. 

Defining thresholds for training samples 

Thresholds have a great influence in characterising the 
AD for reliable predictions; a test sample that exceeds 
the threshold condition is associated with an unreliable 
prediction. 

Like the adaptive kernel methods, instead of defining a 
general unique threshold as seen with several classical 
AD approaches, the proposed approach allocates a set of 
thresholds corresponding to the various training samples. 

For a given value of /<, threshold allocation process can 
be summarised as follows: 

a) First of all, the distances of each training sample 
from the remaining n - 1 samples are calculated and 
ranked in increasing order, n being the total number 
of training samples. This will result in a n x (n -1) 
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Figure 1 Scatter plot of the simulated dataset. 



neighbour table D; an entry D t j of the table 
corresponds to the distance of the /-th sample from 
its /-th nearest neighbour: 

Ai<A'2<...<A>-i 

b) The average distance of each /-th sample from its k 
nearest neighbours is calculated considering the first 
k entries in /-th row of the neighbour table: 

k 

;'=i 



di(k) 



k 

where, l<k< n-l and d^k) < d t (k + 1) 

(1) 



A vector d(/c) of average distance values is then 
derived considering all the samples in the training set. 
Next, a reference value (from now on referred as Ref 
Val), d(k) is determined as follows: 



d(k) = Q3(d(/<)) 

+ 1.5[Q3(d(*))-Ql(d(*))] 



(2) 



where, Ql(d(/c)) and Q3(d(/c)) are the values 
corresponding to the 25th and 75th percentiles in 
the vector d(/c), respectively [22]. 
d) Next, the ordered distances of each /-th training 
sample from all other n-l training samples are 
compared with the Ref Val. If the distance value of 
the /-th sample from its given /-th training 
neighbour (where 1 < j < n-l) is less than or equal 
to the Ref Val, then that distance value is retained, 
otherwise is discarded. The number K t of neighbours 



satisfying this condition, minimum zero and 
maximum being n-l, defines the density of the /-th 
sample neighbourhood: 



K r . {Dij<d(k)} V;:l,«-1 



(3) 



e) Finally, each /-th training sample is associated with a 
threshold t t which defines the width of its 
neighbourhood as: 



M 



(4) 



If no distance value was retained for a given /-th train- 
ing sample (7<Q = 0), then its threshold t t would be theo- 
retically settled to 0, but a pragmatic solution is to set it 
equal to the smallest threshold of the training set. 

The plot in Figure 2 provides with an overview of the 
thresholds for all the 50 samples in the simulated dataset. 
As expected, most of the training samples within the clus- 
ter (for instance, samples 2, 33 and 39) were associated 
with higher K t values. On the other hand, obvious poten- 
tial outliers (samples 49 and 50) had their thresholds equal 
to 0 since they couldn't satisfy the threshold criterion even 
for a single training neighbour (i.e. K t = 0), thus no dis- 
tance values contributed to their threshold calculation. 
Nevertheless, they were associated with the minimum 
threshold equal to 0.42, i.e. the threshold of sample 43. 
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Figure 2 Simulated data set. Thresholds t, vs. number of training 
neighbours K-, plot (k = 12). 
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Evaluating AD for new/test samples 

Until this point, each training sample was associated with 
its individual threshold. The next step will be to character- 
ise the AD which usually relies upon a set of conditions 
that will decide if a given test sample can be associated 
with a reliable prediction or not. 

The criterion used by this approach to associate a 
given test sample to be within the domain of applicabil- 
ity can be summarised below. 

Given a test sample, its distance from all the n training 
samples is calculated and simultaneously, compared to 
be less than or equal to the thresholds associated with 
those training samples. If this condition holds true with 
at least one training sample, the test sample will be con- 
sidered inside the domain of applicability for that model. 
Otherwise, the prediction for that test sample will be 
rendered unreliable. 

More formally, given the training set TR, for each test 
sample the AD decision rule is: 

jeAD iff lieTR: DyZti (5) 

where is the distance between the ;-th test sample 
and the z-th training sample and t t is the individual 
threshold of the latter. In addition, each test/new sample 
will be associated with the number Kj of nearest training 
neighbours for which the previous condition holds true. 
This number can be assumed as a measure of prediction 
reliability; indeed, high values of Kj indicate that the new 
sample falls within a dense training region of the models 
space, while low values of Kj denote that the new sample 
still belongs to the models space, but located in sparse 
training regions. Kj equal to zero rejects the sample as it 
being outside the models AD since no training neigh- 
bours are identified. 

Figure 3 provides with the contour plot for the simu- 
lated dataset derived projecting several data points enough 
to fill the training space. Thresholds were calculated using 
12 nearest neighbours and Euclidean distance. This choice 
of k = 12 nearest neighbours was based on the results 
derived performing an internal /^-optimization, discussed 
later in this article. The space enclosed around the cluster 
represented as black line indicates that all the data points 
within this enclosed region were inside the AD. Thus, this 
region reflects in a way how the AD was characterised for 
this two-dimensional dataset. Area of this enclosed region 
tends to expand or shrink depending upon the number of 
nearest neighbours used for threshold calculation. 

As explained earlier, the extreme outliers in the train- 
ing space will be associated with the number K t of neigh- 
bours equal to zero and the lowest possible threshold in 
the training set. Consider the sample 49 from the simu- 
lated dataset which is an extreme outlier with its thresh- 
old equal to 0.42. If there is a test sample that seems to 
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Figure 3 Simulated data set. Contour plot to demonstrate how 
the AD was characterised. Metric used: Euclidean distance; k = 12. 
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be quite in the vicinity of this potential outlier within 
the descriptor space, the test sample will be associated 
with an unreliable prediction since its distance from 
sample 49 will likely exceed the small threshold. Now, 
consider a case, where the descriptor values for another 
test sample exactly overlap or are very similar to those 
for this potential outlier. In this situation, the distance of 
that sample from the outlier will be less than the thresh- 
old and thus it will be considered within the domain of 
applicability. In theory, this is not wrong because the po- 
tential outlier is still a part of the training space. Practi- 
cally, the approach retains all the training samples to 
characterize the AD but minimizing the role of potential 
outliers in doing so. That's the reason why the first test 
sample was excluded from being reliably predicted while 
the second sample was not. However, for the latter the 
number Kj of nearest training neighbours will likely be 
equal to one indicating that its prediction has some de- 
gree of uncertainty. In conclusion, there exists a relation 
between the defined AD and the impact of training sam- 
ples in characterising it based on their threshold values. 

Optimizing the smoothing parameter k 

Another important aspect is concerning the choice of an 
appropriate smoothing parameter /c, whose theoretical 
range is between 1 and n-1. 

Very low k values will restrict the domain of applic- 
ability in a very strict manner as compared to the AD 
derived opting for larger k values. This is because, an 
opted k value will have a direct impact on the threshold 
calculations which in turn can make it more rigid or easier 
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for test samples to satisfy the threshold criterion. The 
strategy implemented in this article to select an appropri- 
ate k value was performed by Monte Carlo validation, 
maximizing the percentage of the test samples considered 
within the AD, i.e. satisfying AD criterion (Equation 5). 

Box-and-whisker plots (box plots) were produced to 
get an overview of all these derived results. For instance, 
consider the plot in Figure 4 derived for the simulated 
dataset showing percentage of test samples retained within 
the AD with different k values (optimization carried out 
with 20% of samples in the test set and 1000 iterations). 

Median quartile in the middle of the box (marked in 
red) can be referred for all the k values to get a hint 
about how many test samples were retained on average 
during the optimization process for a given k value. The 
top and bottom edges of each box plot (quartiles Q3 and 
Ql) correspond to 75th and 25th percentile, respectively. 
The whisker can extend further from Ql-w(Q3 - Ql) 
until Q3 + w(Q3 - Ql), of 1.5 [23]. The test samples fall- 
ing outside this coverage are considered as outliers and 
are highlighted as V in red. About their usefulness in 
the proposed AD approach, box plots showing limited 
spread and allowing majority of test samples to be 
retained within the AD can be favoured and their corre- 
sponding range of k values can be considered to finally 
opt for the most appropriate k. Additionally, a line plot 
is integrated in the same figure indicating the mean per- 
centage of test samples that were considered within the 
AD for each k value. A simultaneous interpretation of 
both these plots can make it easier for a user to decide 
upon an appropriate k value. 

Figure 4 shows that the spread of the box plots for ini- 
tial k values is quite large. This may have resulted due to 
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Figure 4 Simulated data set. Box-and-whisker plot of test samples 
(%) retained within the AD for different k values during 
/c-optimization. 



the impact of restricted training thresholds that excluded 
several test samples from the AD. With an increase in k 
values, the spread narrowed, however the outliers were 
still present until k = 17. After this point, the box plots 
remained unchanged throughout the plot with no out- 
liers. Similar observations were derived from the mean 
line plot which showed a significant rise initially fol- 
lowed by a stable curve until the first half of the k 
values. The plot didn't show any major changes for the 
second half of the k values. In order to avoid very high k 
values good enough to unnecessarily expand the defined 
AD, a k value of 12 was opted as appropriate k for this 
dataset. The plots dealt earlier (Figures 2 and 3) for this 
dataset were thus derived using this opted k value. 

We also performed an extended analysis on several di- 
verse data sets (results not reported in this paper), to study 
the influence of the smoothing parameter k on models 
AD definition. It was concluded that optimization of k can 
be a time-demanding procedure especially in the case of a 
huge number of samples, but it was also observed that this 
approach is quite insensitive to the smoothing parameter 
/c, except for very small k values which led to the results 
influenced by local noise. Therefore, for many applications 
the optimization of the smoothing parameter can be 
avoided and reasonable results can instead be obtained by 
a fixed k value empirically calculated as n 113 . 

Reflecting the reliability in derived results 

After the AD approach has been applied to the model of 
interest, several features will be taken into account to re- 
flect upon the derived results. Moreover, as stated earlier 
the response domain will be taken into account to address 
the reliability in the results derived by characterising the 
AD of a model in its descriptor space. 

In order to reflect upon a models predictive ability, the 
predictive squared correlation coefficient (Q 2 ) was used. 
Since the test samples excluded from the models AD are 
unreliably predicted, in theory they should not be ac- 
counted for to calculate the models statistics (Q 2 ). 

The following key parameters were evaluated: 

a) Number of test samples retained within the AD. 

b) Q 2 calculated from the test samples retained within 
the AD [24,25]: 



(6) 



where, yj is the measured response value for the ;-th 
sample and yj its predicted value; n TR and n TS 
represent the total number of training and test 
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samples, respectively, and y TR is the mean response 
of the training set. 

c) List of all the test samples considered outside the 
AD. 

d) For each ;-th test sample, the absolute standardized 
error calculated as: 



where, y ; « is the measured value for the ;-th sample 
and jj its predicted value; s Y the standard error of 
estimate derived from the training set. 
e) The information about how many times the 

threshold criterion (Equation 5) is satisfied by each 
test sample, that is, how many training neighbours 
(i.e. Kj) are located at a distance less than or equal to 
their threshold values, from a given test sample. 

In theory, a test sample satisfying the threshold criterion 
several times (i.e. having high Kj) is expected to be pre- 
dicted with higher accuracy. This can be desired since less 
distant training neighbours indicate a higher structural 
similarity of the test sample. On the contrary, a test sam- 
ple satisfying the threshold criterion for no training neigh- 
bours (Kj = 0) indicates that there wasn't any training 
sample similar enough to reliably predict that test sample. 

Results and discussion 

As the case study to derive results with the proposed strat- 
egy, the CAESAR Model 2 to predict bioconcentration 
factor (BCF), which was developed under the EU project 
CAESAR following the REACH requirements, was se- 
lected. It is a Radial Basis Function Neural Network 
(RBFNN) model derived from 378 training and 95 test 
samples [20,21]. The five descriptors used to develop this 
model were calculated using Dragon 5.5 [26]. 

The statistics for this model are summarized in Table 1. 

For comparison purposes, some AD approaches taken 
from literature [2,7-10] were implemented on the se- 
lected case study. Among them, the classical kNN-based 
AD approach [6,18] was implemented by calculating 
average distances of all the training samples from their 5 
nearest neighbours (i.e. k = 5); since the choice of 



Table 1 Summary of model statistics for the case study 



Model 




Training set 


Test set 






R 2 RMSE n TS 


Q 2 RMSEP 


CAESAR Model 2 


378 


0.804 0.591 95 


0.797 0.600 



R 2 Determination coefficient; RMSE Root-mean-square error; Q 2 Predictive 
squared correlation coefficient; and RMSEP Root-mean-square error 
of prediction. 



thresholds didn't follow any strict rules in the existing 
literature, the value corresponding to 95th percentile in 
this vector of average distances was considered as gen- 
eral threshold. If the average distance of a test sample 
was lesser than or equal to the threshold value, the test 
sample was retained within the AD. 

In addition to the classical kNN-based AD approach, 
the following methodologies were considered [2,7-10]: the 
Bounding Box, which is based on the ranges of model var- 
iables; its variant based on principal components instead 
of the original variables (PCA-Bounding Box); the Convex 
Hull, which is the smallest convex area that contains the 
original set; two distance-based methods, which calculate 
the distance (Euclidean and Mahalanobis) of a test sample 
from the data centroid and use the 95th percentile of the 
training sample distances as threshold. 

Finally, some methods for probability density function 
estimation were also considered. Among the multivariate 
kernel density methods, four variants of Gaussian kernel 
estimators were implemented [19]: fixed Gaussian kernel 
with bandwidth equal to 0.462 (for the studied data set); 
optimized Gaussian kernel with a smoothing parameter 
equal to 0.237 obtained by leave-one-out cross-validation 
[27]; variable Gaussian kernel with bandwidth calculated as 
the inverse function of the Euclidean distance to k-th neigh- 
bour (k =15) [27]; adaptive Gaussian kernel, with fixed 
Gaussian kernel as the pilot estimate and sensitivity param- 
eter a equal to 0.5. Finally, Epanechnikov kernel with a 
fixed bandwidth equal to 1.961 and the nearest neighbour 
density estimator with smoothing parameter k equal to 15, 
were also considered [19]. 

For all the implemented methods, except for Bounding 
Box and Convex Hull, autoscaling was adopted as data 
pretreatment. 

The proposed AD strategy was implemented in 
MATLAB [28] using autoscaled Euclidean distances. The 
/^-optimization procedure was carried out initially to de- 
cide upon an optimal k value; the training set of 378 sam- 
ples was randomly partitioned 1000 times selecting 20% 
of samples in the test set (i.e. 75 samples). The box plots 
in Figure 5 summarize the percentage of test samples 
retained within the AD for different k values (up to 25). 

As expected, the first lower k values were associated 
with box plots having highest spread. This degree of dis- 
persion lowered gradually with increase in number of 
neighbours considered. The line plot of the mean showed 
an increase in the number of samples throughout the plot, 
however, this increment after initial k values was gradual. 
Based on their lower spread and preference to retain rea- 
sonably higher number of samples within the AD (as 
reflected from their median), the k values in the range of 
15-19 were considered further to decide upon an optimal 
k. Finally, to avoid unnecessarily higher training thresholds 
and their resulting impact on the defined AD, k = 15 was 



Sahigara et al. Journal of Cheminformatics 2013, 5:27 
http://www.jcheminf.eom/content/5/1/27 



Page 7 of 9 




° 123456789 1011 1213141516171819202122232425 

k 

Figure 5 CAESAR BCF model. Box-and-whisker plot of test 
samples (%) retained within the AD for different k values during 
/(-optimization. 



considered as an optimal choice for this case study. Con- 
sidering the selected k value of 15, the novel AD approach 
identified four test samples (33, 61, 82 and 83) being out- 
side the models AD. 

To reflect upon the reliability in the results derived 
with this approach, absolute standardized error of all the 
test samples was plotted against their corresponding Kj 
values. As shown in Figure 6, four test samples consid- 
ered outside the AD with this approach were associated 
with a value of Kj = 0. The absolute standardized error 
for sample 33 was quite higher as compared to the 
remaining three samples. As seen clearly from the plot, 
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Figure 6 CAESAR BCF model. Absolute standardized error of test 
samples plotted against their K- } values. 



there had been a sharp decrease in the prediction error 
of test samples with an increase in Kj. However, it can't 
be denied that this pattern wasn't rigidly followed in the 
results. There had been test samples with very low Kj 
values but extremely low or negligible absolute standard- 
ized error, meaning that even less reliable predictions 
can have good accuracy. In any case, this plot somehow 
tried to interpret the AD results derived in models de- 
scriptor space taking into account the response domain, 
and clearly informed about both, reliability and accuracy 
in the predictions of the test samples. Most of the pre- 
dictions had good accuracy, them being within two units 
of standardized error and high reliability. The samples 
corresponding to these reliable predictions were associ- 
ated with higher Kj values, thus being well represented 
by several training samples. 

Generally, a standardized error of two/three units is 
usually considered as a warning value for outliers detec- 
tion. In Figure 6, six test samples (12, 33, 51, 52, 75 and 
90) exceeded a two-unit threshold for the absolute stan- 
dardized error indicating them as outliers in the models 
response domain. It can be interesting to further evalu- 
ate the reasons behind categorising them as outliers; 
however, this is beyond the scope of this article as the 
proposed AD approach is defined within the models de- 
scriptor space. Nevertheless, this evaluation identifies 
sample 33 as an outlier in models descriptors space as 
well as its response domain which further supports the 
results derived from the proposed approach to exclude 
this sample from the models AD. 

Finally, the results derived by this approach were com- 
pared with those derived from classical AD approaches. 
Table 2 reports these results; the first row shows results 
when no AD approach has been applied to bound the 
models descriptor space. 

The number k of nearest neighbours considered with 
the proposed approach (i.e. 15) was comparatively higher 
than the one considered with classical kNN (i.e. 5); how- 
ever, the impact on model statistics was not so obvious 
on the resulting Q 2 , while the number of retained sam- 
ples increased from 87 (classical kNN) to 91 (proposed 
approach). Discussing the results derived with classical ap- 
proaches, number of samples retained within the AD var- 
ied significantly depending on what strategy was used. 
Convex hull, optimized and variable Gaussian kernel 
methods retained the least number of samples while the 
Bounding Box considered none of the test samples outside 
the AD. Overall, the proposed approach worked quite well 
on the CAESAR model, trying to define an AD with max- 
imum retained test samples within the domain and posi- 
tive impact on the model statistics. 

The last column of Table 2 reports the list of samples 
considered outside the AD with all the approaches. Irre- 
spective of total number of samples considered outside 
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Table 2 Comparison of AD methods applied to the test set of CAESAR BCF model 



Approach 


IN AD 


Q 


OUTSIDE AD 


All samples inside (no AD approach) 


yo 


u./y/ 


None 


Proposed approach (Euclidean dist., k = 15) 


01 

y i 


O QH3 


33 £1 Q~) QQ 
JJ O I OZ OJ 


Bounding box 


yj 


n 7Q7 

u./y/ 


None 


PCA bounding box 


93 


0.804 


33 40 


Convex hull 


73 


0.789 


3 7 9 13 18 33 34 36 37 38 

59 4U 4 I 45 b I bo o I /z /9 y I yz y4 


Euclidean dist (95 percentile) 


QQ 
OO 


U.oUz 


Q QQ QA Q7 /in /I ~) 
J JJ DO D/ 4U 4Z O I 


Mahalanobis dist (95 percentile) 


QQ 

oy 


u./y I 


1 Q /IQ ^/l ^1 QQ Q1 
I o 4b j4 O I ob y I 


np^irpl kNN (f\ irlirlp?in Hkf k — ^ 


87 


0 797 


Q QQ QA AD fb1 R? Q4 
J JJ J^ t-U U I OZ OJ ^ i 


Fixed Gaussian kernel 


85 


0.794 


3 24 33 34 40 61 82 83 91 94 


Optimized Gaussian kernel 


66 


0.831 


3 912 22 24 33 34 38 40 45 47 51 53 54 

56 61 68 69 75 76 80 82 83 87 89 91 93 94 95 


Variable Gaussian kernel {k = 15) 


81 


0.790 


3 24 33 34 40 43 61 80 82 83 89 91 94 95 


Adaptive Gaussian kernel 


88 


0.801 


3 33 43 61 82 83 91 


Fixed Epanechnikov kernel 


87 


0.799 


3 33 40 43 61 83 91 94 


Nearest neighbour density estimator (k = 15) 


91 


0.806 


3 33 61 91 



the AD, all the methods converged significantly identify- 
ing a subset of common samples that were always ex- 
cluded from the models AD. 

Conclusions 

A novel kNN-based approach to define the AD of QSAR 
models was proposed. The overall execution of this ap- 
proach was performed in three different phases that ef- 
ficiently used the salient features of kNN principle to 
define a models AD in its descriptor space. Significant 
features that distinguished the proposed AD approach 
include defining individual threshold for each training 
sample, optimizing the smoothing parameter k to be 
considered and taking into account the models response 
domain to reflect upon the reliability of results derived 
in its descriptor space. 

In the proposed AD method, the appropriate number 
k of neighbours can be chosen on the basis of the plot 
with retained samples vs. k values obtained by Monte 
Carlo validation; it allowed to identify a smoothed region 
of the k values where the results remained unchanged, 
ensuring high robustness in the AD definition. 

The results on the selected case study defined an AD 
with a positive impact on model statistics retaining max- 
imum possible samples that were reliably predicted. 
Comparison of the derived results with those from the 
classical approaches by no means intended to project 
the pitfalls of existing approaches but it was aimed to 
have a performance evaluation of this novel strategy to 
understand how its implementation could lead to obtain 
similar or different results as compared to the classical 



ways of defining the AD. An extended comparison of the 
different AD approaches on several diverse data sets have 
indicated the following relevant features that characterize 
the proposed AD approach: a) adaptability to local density 
of samples, useful when the underlying multivariate distri- 
bution is asymmetric, with wide regions of low data density; 
b) unlike several kernel density estimators, effectiveness 
also in high-dimensional spaces; d) low sensitivity to the 
smothing parameter k; d) versatility to implement various 
distances measures other than Euclidean distance, such as 
Manhattan distance, Mahalanobis distance and the recently 
proposed locally-centred Mahalanobis distance [29], de- 
pending on the data set in analysis. 

A MATLAB module for the models AD estimate by 
different approaches will be soon available at http:// 
michem.disat.unimib.it/chm/. 
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